Social media users leave digital traces that can be leveraged to guide consumer purchasing decisions. Twitter users, in particular, provide temporal data that can be analyzed to develop a more precise understanding of users’ perspectives or sentiments about a service, company, or product. This information, in turn, can be shared with the public to allow customers to make decisions based on crowd-sourced knowledge. It may also even help the service provider understand how they can improve their services.
Imagine it is Friday in July and you are finishing up a long week at the office. The clock reads 4:00pm and you quickly race to pack up your work station. You need to get downtown to catch the 5pm MegaBus from NYC to Washington, DC. You are ready to start the weekend and have plans at 10pm in downtown DC. You navigate the NYC subway and break a sweat walking quickly down 34th Street, hoping that you are early enough to score an aisle seat. As you approach the bus stop you come into view of your worst nightmare: a line with hundreds of angry Megabus passengers waiting with no bus in sight. As you walk towards the end of the line another passenger mentions that the 3:30pm bus hasn’t arrived yet. A Megabus employee tells you he has no idea when the 5pm bus will depart (“Who knows…”). It is 90 degrees, you are sweating profusely, and you will surely miss your evening plans. You feel helpless. A wronged consumer with no recourse. Then you remember a social media site called Twitter…
Based on our personal experience and the experience of others, we aim to analyze recent tweets to understand the degree of positive and negative sentiment among users of the low-cost bus company, Megabus, over time. While Megabus is an inexpensive and convenient transportation option with many buses departing major cities each day, we hypothesized that some days or times of year may be more reliable, i.e., buses have fewer breakdowns and delays, than others. Using social media, data visualization and statistics we provide an informative and entertaining sentiment analysis of Megabus experiences from January 1, 2016 to April 1, 2016. Our work offers future bus riders information on the type of experience they can expect when riding the Megabus.
As a team, we were originally interested in building a predictive model for Megabus experience based on time of day, location and bus route. Based on the data that we were able to collect from Twitter, these questions evolved to focus on the relationship between Megabus passenger sentiment, the volume of Megabus tweets, day of the week, whether the tweet was posted on a weekday (Monday-Thursday) or a weekend (Friday-Sunday), and month of the year.
The questions we have answered as part of this analysis include the following: 1. How can we most effectively scrape Twitter for data related to Megabus sentiment? 2. What is overall Megabus sentiment and volume of Megabus tweets during the period from January 1, 2015 to April 1, 2016? 3. How does Megabus sentiment and volume of tweets differ based on day of the week and month of the year? 4. What is the relationship between volume of Megabus tweets on a given day and the overall sentiment we expect to see?
We aimed to obtain historical tweet data in order to observe trends in tweet volume and sentiment over time and identify associations with factors such as time of year, month, and day. Since there are many factors influencing tweet volume and sentiment, we wished to obtain the largest possible data set.
Initially, we set up an app (actually, three apps, one for each of us) on the Twitter API that obtained tweets including the word “megabus.” This was a great learning experience; we remember fondly this effort as our first true collaboration on git; our first exposure to the Twitter API; when we first set up an .Rprofile to store key sensitive variables; and the first tweet data we obtained. However, it became clear that the Twitter API would only provide us with tweets from the last week. This was insufficient for most of the trends we wished to consider.
A google search led us to Jefferson Henrique’s python package GetOldTweets-python. This package enabled us to obtain tweets from any date (at least, we did not encounter a date that was inaccessible). The original package can be run to print tweets in the terminal. We modified the script so that it would print tweets to a csv file. As a result we obtained over 100,000 tweets from January 1, 2015 through April 1, 2016 containing the word “megabus.”
We conducted much of our exploratory analysis using only tweets from the first three months of 2015. Though some questions of interest pertained to the entire 15-month period, we wanted to make sure the data frame was not too large during our exploratory phase. There was initially some evidence that the data frame was too large because some commands were slow. However, it was later identified that the only time-consuming command was an inefficiently written for loop that created several new columns. This was improved to a series of mutate() functions, and since then, wrangling the data frames takes only a few seconds from start to finish.
In order to look at variations in sentiment over days and time, we parsed the combined date and time variable into two new variables, containing just date and time information. We also used the months() function to create a month column and the weekdays() function to create a column that identified the tweet day and another to identify the tweet’s status binary weekend status, with Friday counting as part of the weekend.
We cleaned the data using dplyr and the tidytext package, in order to help with the text mining tasks necessary for sentiment analysis, available here: https://github.com/juliasilge/tidytext.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.2.4
library(readr)
library(devtools)
## Warning: package 'devtools' was built under R version 3.2.3
library(tidytext)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.3
library(readr)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, last
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(tm)
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(devtools)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:tidyr':
##
## %>%, crossing
## The following objects are masked from 'package:dplyr':
##
## %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(gapminder)
library(car)
## Warning: package 'car' was built under R version 3.2.4
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Increase digits so that the many-digit tweet id does not cause a problem.
options(digits = 22)
tweets_df_all <- read_csv("Jan_2015-April_2016.csv")
#Change the column names.
names(tweets_df_all) <- c("id","username","text","date","geo","retweets","favorites","mentions","hashtags")
#get a subset of random lines from the full set to work with:
#tweets_df_all <- tweets_df_all[sample(1:nrow(tweets_df_all), 10000, replace=FALSE),]
tweets_df_all <- tweets_df_all %>%
mutate(time=format(date, format="%H:%M:%S")) %>%
mutate(date2 = format(date, format="%m-%d-%y")) %>%
mutate(month = months(as.Date(date2,'%m-%d-%y'))) %>%
mutate(weekend = weekdays(as.Date(date2,'%m-%d-%y'))) %>%
mutate(weekend_binary = ifelse(weekend == "Saturday"|weekend=="Sunday"|weekend=="Friday", 1, 0)) %>%
filter(date2 != "12-31-14") %>%
filter(date < "2016-04-01")
# filter out duplicates
tweets_df_all <- tweets_df_all %>%
distinct(id)
#nrow(tweets_df_all)
tweets_df_all <- tweets_df_all %>%
distinct(text)
# explore number of tweets per user including megabus handles
prolific_tweeters_all <- tweets_df_all %>%
group_by(username) %>%
summarise(tweets = n()) %>%
arrange(desc(tweets))
# filter out tweets from megabus operators
tweets_df_all = tweets_df_all[!grepl("megabus|megabusuk|MegabusHelp|megabusit|megabusde|megabusGold", tweets_df_all$username),]
prolific_tweeters_filtered <- tweets_df_all %>%
group_by(username) %>%
summarise(tweets = n()) %>%
arrange(desc(tweets))
ggplot(filter(prolific_tweeters_filtered, tweets>0), aes(tweets)) +
geom_histogram(binwidth = 1) + xlab("Number of megabus tweets per user") + ylab("Number of users") + theme_hc()
ggplot(filter(prolific_tweeters_filtered, tweets>8), aes(tweets)) +
geom_histogram(binwidth = 1) + xlab("Number of megabus tweets per user") + ylab("Number of users") + theme_hc()
ggplot(data=tweets_df_all, aes(x=as.Date(date2,'%m-%d-%y'))) +
geom_histogram(aes(fill=..count..), binwidth=1) +
scale_x_date("Date") +
scale_y_continuous("Frequency")
#The outlying days with high tweet volume are:
tweets_df_all %>% group_by(date2) %>% count(date2, sort = TRUE) %>% filter(n>500)
## Source: local data frame [3 x 2]
##
## date2 n
## (chr) (int)
## 1 04-13-15 1022
## 2 02-21-16 986
## 3 05-13-15 519
#The outlying days with low tweet volume are:
tweets_df_all %>% group_by(date2) %>% count(date2, sort = TRUE) %>% filter(n<100)
## Source: local data frame [5 x 2]
##
## date2 n
## (chr) (int)
## 1 01-27-16 97
## 2 01-01-16 95
## 3 12-25-15 65
## 4 04-01-16 18
## 5 12-31-15 10
On each of the outlying days for high tweet volume, there were significant news stories about Megabus. More on that below.
Like most sentiment analyses, we relied on a lexicon of positive and negative words. We started with the Bing lexicon in the tidytext package in the sentiment dataset (described here in more detail https://www.cs.uic.edu/~liub/). This lexicon has been compiled over the past 12 years by researchers at the University of Illinois at Chicago.
After initial review of the Bing lexicon, we realized that we need to supplement the list of positive and negative words in order to capture the extent of Megabus transportation-specific sentiments. We added a total of 57 sentiments associated with Megabus and transportation experience. These additional negative and positive words were identified by manually reviewing a random sample from 2900 tweets containing the keyword, “megabus” queried on 4/20/2016, following methods similar to those described here: http://www.wired.com/2015/02/best-worst-public-transit-systems-according-twitter/. This review, comparing the Bing lexicon with Megabus tweets, also revealed that a few sentiments already existing in the Bing lexicon needed to be recorded or removed. These included: changing “uneventful” from negative to positive, reversing “cheap” from negative to positive given Megabus’s platform as a cheap transportation provider, and the removal of “like” which could be positive or negative and is probaby slang.
by_word <- tweets_df_all %>%
dplyr::select(text, id, date, date2, time, weekend, weekend_binary, month) %>%
unnest_tokens(word, text)
# look at most commonly tweeted words
by_word_count <- by_word %>%
count(word, sort = TRUE)
head(by_word_count)
## Source: local data frame [6 x 2]
##
## word n
## (chr) (int)
## 1 megabus 95116
## 2 the 51102
## 3 to 43150
## 4 a 34032
## 5 on 30676
## 6 i 25996
megabus_lexicon <- read_csv("megabus_lexicon.csv")
# create new dataframe of bing and megabummer sentiments
bing_megabus <- megabus_lexicon %>%
filter(lexicon %in% c("bing","megabummer")) %>%
dplyr::select(-score)
head(bing_megabus %>% filter(lexicon=="megabummer"))
## Source: local data frame [6 x 4]
##
## NA word sentiment lexicon
## (int) (chr) (chr) (chr)
## 1 20207 uneventful positive megabummer
## 2 23166 working negative megabummer
## 3 23167 cried negative megabummer
## 4 23168 neveragain negative megabummer
## 5 23169 crying negative megabummer
## 6 23170 pee negative megabummer
# join tweets with sentiment and add score column
mb_sentiment <- by_word %>%
inner_join(bing_megabus) %>%
mutate(score = ifelse(sentiment == "positive", 1, -1))
## Joining by: "word"
head(mb_sentiment %>% dplyr::select(id,word,sentiment,score))
## Source: local data frame [6 x 4]
##
## id word sentiment score
## (dbl) (chr) (chr) (dbl)
## 1 550802761462603776 smash negative -1
## 2 550798250463473664 working negative -1
## 3 550798250463473664 love positive 1
## 4 550798250463473664 work positive 1
## 5 550797004344152064 struggle negative -1
## 6 550797004344152064 waiting negative -1
# calculate score for each tweet
dt <- data.table(mb_sentiment)
#build data set for sentiment analysis containing data on each tweet including sentiment score
mb_sentiment_tweet <- unique(dt[,list(score_tweet = sum(score), freq = .N, date, weekend_binary, date2, weekend, month), by = c("id")] )
tweets_df_all_joiner <- tweets_df_all %>% dplyr::select(id,text)
mb_sentiment_tweet <- left_join(mb_sentiment_tweet,data.table(tweets_df_all_joiner),by="id")
head(mb_sentiment_tweet)
## id score_tweet freq date weekend_binary
## 1 550520819819220992 0 2 2015-01-01 00:16:06 0
## 2 550522977519824896 2 2 2015-01-01 00:24:40 0
## 3 550525690983239680 1 1 2015-01-01 00:35:27 0
## 4 550536610228731904 -2 2 2015-01-01 01:18:51 0
## 5 550574314970697664 0 2 2015-01-01 03:48:40 0
## 6 550579369366663168 -1 1 2015-01-01 04:08:45 0
## date2 weekend month
## 1 01-01-15 Thursday January
## 2 01-01-15 Thursday January
## 3 01-01-15 Thursday January
## 4 01-01-15 Thursday January
## 5 01-01-15 Thursday January
## 6 01-01-15 Thursday January
## text
## 1 @megabus regarding bad weather, whats your ticket refund policy? As my bus from DC to Toronto is en route back to DC after philadelphia stop
## 2 Photo clearly shows bus listing to the right. Megabus on Borman Expressway taken out of service, police say: http:// bit.ly/14eeRha via @nwi
## 3 My bus ticket home from the Toronto airport for Pax East is $1. I love Megabus .
## 4 So anyway megabus is a group of shit eaters never again
## 5 Best exchange of 2014 on a Megabus @kaoskelley : [laughing] "Come on, man." @SoxHawksNBulls: [angrily] " 'Come on' my ass!!!....Goddamn it."
## 6 Megabus Problems Continue After Bus Placed Out of Service in Lake County #Indianapolis http:// bit.ly/1ENSmzF
#Creating data table of calendar dates, including weekend status, day of week (column name weekend), month, and tweet frequency and sentiment
mb_sentiment_date <- unique(mb_sentiment_tweet[,list(score_date = round(mean(score_tweet),2), freq = .N, weekend_binary, weekend, month), by = c("date2")] )
mb_sentiment_date <- mb_sentiment_date %>% filter(freq<500)
head(mb_sentiment_date)
## date2 score_date freq weekend_binary weekend month
## 1 01-01-15 -0.4699999999999999733546 114 0 Thursday January
## 2 01-02-15 -0.4299999999999999933387 174 1 Friday January
## 3 01-03-15 -0.4199999999999999844569 106 1 Saturday January
## 4 01-04-15 -0.2500000000000000000000 115 1 Sunday January
## 5 01-05-15 -0.1900000000000000022204 102 0 Monday January
## 6 01-06-15 -0.3200000000000000066613 106 0 Tuesday January
#Creating data table of calendar dates and tweet frequency and sentiment with holiday status (including federal holidays, Valentine's Day, and Halloween)
mb_sentiment_holidays <- mb_sentiment_date %>%
mutate(holiday = ifelse(date2 == "01-01-15" |
date2 == "01-19-15" |
date2 == "02-14-15" |
date2 == "02-16-15" |
date2 == "05-25-15" |
date2 == "09-07-15" |
date2 == "10-12-15" |
date2 == "10-31-15" |
date2 == "11-11-15" |
date2 == "11-26-15" |
date2 == "12-25-15" |
date2 == "01-01-16" |
date2 == "01-18-16" |
date2 == "02-14-16" |
date2 == "02-15-16",1,
0
))
head(mb_sentiment_holidays)
## date2 score_date freq weekend_binary weekend month
## 1 01-01-15 -0.4699999999999999733546 114 0 Thursday January
## 2 01-02-15 -0.4299999999999999933387 174 1 Friday January
## 3 01-03-15 -0.4199999999999999844569 106 1 Saturday January
## 4 01-04-15 -0.2500000000000000000000 115 1 Sunday January
## 5 01-05-15 -0.1900000000000000022204 102 0 Monday January
## 6 01-06-15 -0.3200000000000000066613 106 0 Tuesday January
## holiday
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
The describe() function in Hmisc helped us see how tweet sentiment scores are distributed. In particular, we noticed that the distribution of by tweet sentiment scores was similar to the distribution of by date sentiment scores (the average sentiment score for that day). We therefore focused our analysis on the by date sentiment scores because it was a convenient bin for looking at trends over time and for moderating the considerable variance. The by date sentiment score gives a sense for the overall level of sentiment on any given day, which is what we ultimately decided to consider as our main outcome.
describe(mb_sentiment_tweet)
## mb_sentiment_tweet
##
## 9 Variables 51683 Observations
## ---------------------------------------------------------------------------
## id
## n missing unique Info Mean .05 .10
## 51683 0 51682 1 6.29e+17 5.591e+17 5.679e+17
## .25 .50 .75 .90 .95
## 5.887e+17 6.248e+17 6.666e+17 6.999e+17 7.076e+17
##
## lowest : 5.505e+17 5.505e+17 5.505e+17 5.505e+17 5.506e+17
## highest: 7.158e+17 7.158e+17 7.158e+17 7.158e+17 7.158e+17
## ---------------------------------------------------------------------------
## score_tweet
## n missing unique Info Mean .05 .10 .25 .50
## 51683 0 18 0.92 -0.3476 -2 -2 -1 -1
## .75 .90 .95
## 1 1 2
##
## -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
## Frequency 1 1 2 1 26 91 406 1750 6606 19545 5917 13307 3246 641 122
## % 0 0 0 0 0 0 1 3 13 38 11 26 6 1 0
## 5 6 7
## Frequency 18 2 1
## % 0 0 0
## ---------------------------------------------------------------------------
## freq
## n missing unique Info Mean .05 .10 .25 .50
## 51683 0 10 0.79 1.623 1 1 1 1
## .75 .90 .95
## 2 3 3
##
## 1 2 3 4 5 6 7 8 9 10
## Frequency 29619 14681 5291 1568 419 72 23 6 3 1
## % 57 28 10 3 1 0 0 0 0 0
## ---------------------------------------------------------------------------
## date
## n missing unique
## 51683 0 51599
## Info Mean .05
## 1 2015-08-05 11:40:21 2015-01-24 17:03:02
## .10 .25 .50
## 2015-02-17 22:02:03 2015-04-16 08:34:12 2015-07-24 22:24:34
## .75 .90 .95
## 2015-11-17 11:09:19 2016-02-17 05:38:50 2016-03-09 08:11:50
##
## lowest : 2015-01-01 00:16:06 2015-01-01 00:24:40 2015-01-01 00:35:27 2015-01-01 01:18:51 2015-01-01 03:48:40
## highest: 2016-04-01 01:16:55 2016-04-01 02:04:43 2016-04-01 02:45:24 2016-04-01 03:36:45 2016-04-01 03:40:08
## ---------------------------------------------------------------------------
## weekend_binary
## n missing unique Info Sum Mean
## 51683 0 2 0.75 24522 0.4745
## ---------------------------------------------------------------------------
## date2
## n missing unique
## 51683 0 457
##
## lowest : 01-01-15 01-01-16 01-02-15 01-02-16 01-03-15
## highest: 12-27-15 12-28-15 12-29-15 12-30-15 12-31-15
## ---------------------------------------------------------------------------
## weekend
## n missing unique
## 51683 0 7
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## Frequency 10047 7871 6373 8102 7307 5857 6126
## % 19 15 12 16 14 11 12
## ---------------------------------------------------------------------------
## month
## n missing unique
## 51683 0 12
##
## April August December February January July June March May
## Frequency 4291 3706 2597 6420 5489 4487 3910 7161 3938
## % 8 7 5 12 11 9 8 14 8
## November October September
## Frequency 3182 3448 3054
## % 6 7 6
## ---------------------------------------------------------------------------
## text
## n missing unique
## 51683 0 51683
##
## lowest : _ fuck you Chicago Amtrak agents. FUCK YOU. (@Megabus Chicago to Ann Arbor in Chicago, IL) https://www. swarmapp.com/c/631H6tDGuxb - " Is that the MeggggaaaBus!???" - " Yes that's the MegggaaaBus!!! " Sun Valley guys kill me . Saluting the MegaBus Ai #ResLife - #tuu i #yoo - 19 Injured in Indiana Crash Involving Megabus on I-65: Indiana State Police say 19 injured ... http:// abcn.ws/1FFnP7h - I love megabus bro , my trip to Chicago there & back $26.50 . S/o to them - Kim gon' make me lose my mind. *books Megabus ticket*
## highest: Zero sleep, back pain, and a stranger practically spooning you. Always a fun night on the megabus . Zoe just turned Yeah by Usher off and am off to get a megabus me no longer friends not worth it Zola must have a heart of gold. My black ass would've on been on the first Megabus out of town. ZOMG - a @BoltBus exploded! @Megabus for the win. Zoo the megabus King. Now he don't need a gofundme to take these trips. He got a good job now
## ---------------------------------------------------------------------------
describe(mb_sentiment_date)
## mb_sentiment_date
##
## 6 Variables 455 Observations
## ---------------------------------------------------------------------------
## date2
## n missing unique
## 455 0 455
##
## lowest : 01-01-15 01-01-16 01-02-15 01-02-16 01-03-15
## highest: 12-27-15 12-28-15 12-29-15 12-30-15 12-31-15
## ---------------------------------------------------------------------------
## score_date
## n missing unique Info Mean .05 .10 .25 .50
## 455 0 89 1 -0.3162 -0.623 -0.546 -0.430 -0.320
## .75 .90 .95
## -0.190 -0.074 0.000
##
## lowest : -1.12 -0.95 -0.86 -0.77 -0.76
## highest: 0.10 0.14 0.16 0.19 0.27
## ---------------------------------------------------------------------------
## freq
## n missing unique Info Mean .05 .10 .25 .50
## 455 0 141 1 110.4 61 70 86 104
## .75 .90 .95
## 129 162 181
##
## lowest : 8 35 48 49 50, highest: 227 230 253 273 284
## ---------------------------------------------------------------------------
## weekend_binary
## n missing unique Info Sum Mean
## 455 0 2 0.73 195 0.4286
## ---------------------------------------------------------------------------
## weekend
## n missing unique
## 455 0 7
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## Frequency 66 64 65 64 66 65 65
## % 15 14 14 14 15 14 14
## ---------------------------------------------------------------------------
## month
## n missing unique
## 455 0 12
##
## April August December February January July June March May
## Frequency 30 31 31 56 62 31 30 62 31
## % 7 7 7 12 14 7 7 14 7
## November October September
## Frequency 30 31 30
## % 7 7 7
## ---------------------------------------------------------------------------
describe(mb_sentiment_holidays)
## mb_sentiment_holidays
##
## 7 Variables 455 Observations
## ---------------------------------------------------------------------------
## date2
## n missing unique
## 455 0 455
##
## lowest : 01-01-15 01-01-16 01-02-15 01-02-16 01-03-15
## highest: 12-27-15 12-28-15 12-29-15 12-30-15 12-31-15
## ---------------------------------------------------------------------------
## score_date
## n missing unique Info Mean .05 .10 .25 .50
## 455 0 89 1 -0.3162 -0.623 -0.546 -0.430 -0.320
## .75 .90 .95
## -0.190 -0.074 0.000
##
## lowest : -1.12 -0.95 -0.86 -0.77 -0.76
## highest: 0.10 0.14 0.16 0.19 0.27
## ---------------------------------------------------------------------------
## freq
## n missing unique Info Mean .05 .10 .25 .50
## 455 0 141 1 110.4 61 70 86 104
## .75 .90 .95
## 129 162 181
##
## lowest : 8 35 48 49 50, highest: 227 230 253 273 284
## ---------------------------------------------------------------------------
## weekend_binary
## n missing unique Info Sum Mean
## 455 0 2 0.73 195 0.4286
## ---------------------------------------------------------------------------
## weekend
## n missing unique
## 455 0 7
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## Frequency 66 64 65 64 66 65 65
## % 15 14 14 14 15 14 14
## ---------------------------------------------------------------------------
## month
## n missing unique
## 455 0 12
##
## April August December February January July June March May
## Frequency 30 31 31 56 62 31 30 62 31
## % 7 7 7 12 14 7 7 14 7
## November October September
## Frequency 30 31 30
## % 7 7 7
## ---------------------------------------------------------------------------
## holiday
## n missing unique Info Sum Mean
## 455 0 2 0.1 15 0.03297
## ---------------------------------------------------------------------------
Looking at a line graph of the sentiment score over time is not particularly useful–it is just a blur with much more variability than any clear trend.
ggplot(data=mb_sentiment_tweet, aes(x=date, y=score_tweet)) +
geom_line()
However, with smoothing, ebbs and flows over time become clear. Though there is much variance day to day, smoothing reveals broad seasonal trends. Smoothing is appropriate for tweet volume and sentiment because the outcome is continuous.
options(digits = 3)
ggplot(data=mb_sentiment_tweet, aes(x=date, y=score_tweet)) +
geom_smooth()
options(digits = 22)
This chart shows the distribution of daily sentiment scores. The tweets of most days aired on the negative side, with few days boasting an average tweet score above 0.
options(digits = 3)
ggplot(data=mb_sentiment_date, aes(score_date)) +
geom_histogram(binwidth = 0.1)
options(digits = 22)
We first tested the relationship between tweet volume and tweet sentiment. We expected there to be a correlation, and there was. A linear regression was appropriate for this because while we are uncertain about causality (and indeed, causality may not even be consistent), we had reason to believe that there was a relationship between tweet volume and sentiment, both based on our data exploration.
#Hyp. #1: tweet sentiment on low volume days = on high volume days
h1.lm <- lm(score_date ~ freq, data = mb_sentiment_date)
summary(h1.lm)
##
## Call:
## lm(formula = score_date ~ freq, data = mb_sentiment_date)
##
## Residuals:
## Min 1Q
## -0.9189687986304141409377 -0.1151670734170468085589
## Median 3Q
## -0.0006202760743232015733 0.1307781235143221643114
## Max
## 0.5373698718535243612493
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.1920361274056618106076 0.0266103300080770439895
## freq -0.0011243842454904783967 0.0002280053451030171908
## t value Pr(>|t|)
## (Intercept) -7.216599999999999682 2.2613e-12 ***
## freq -4.931390000000000384 1.1481e-06 ***
## ---
## Signif. codes:
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*'
## 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
##
## Residual standard error: 0.1837266566495620589983 on 453 degrees of freedom
## Multiple R-squared: 0.05094845799962740424, Adjusted R-squared: 0.04885342148307036947
## F-statistic: 24.31864914858560311 on 1 and 453 DF, p-value: 1.148076392352374159e-06
options(digits = 3)
#REJECT THE NULL (p<0.001), conclude that tweet sentiment on low volume days > tweet sentiment on high volume days
# The volume of tweets on a given day is a statistically significant predictor of the average daily sentiment score, and for every additional tweet, we would expect a 0.0014 decrease in average daily sentiment score.
ggplot(data=mb_sentiment_date, aes(x=freq, y=score_date)) +
geom_line() + xlab("Number of megabus tweets") +
ylab("Tweet sentiment score") +
ggtitle("Tweet sentiment as a function of tweet volume") +
theme_hc()
options(digits = 22)
#Commentary:
#While we have found a highly statistically significant trend, the effect is moderate considering the degree of variance in tweet sentiment. In order to explore the meaning and practical application of the relationship, we would need to further examine days with different sentiments of tweets and consider whether the difference as measured by our scoring system is meaningful.
#Test to ensure the two main assumptions of linear regression are met: 1.) Homoskedasticity and 2.) Zero Contiditional Mean (the expected value of the error(residual) for any value of X is 0)
score_date.res = resid(h1.lm)
plot(mb_sentiment_date$freq, score_date.res,
ylab="Residuals", xlab="Number of Tweets",
main="Tweets and Megabus Sentiment")
abline(0,0)
ncvTest(h1.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 8.910717201496140660311 Df = 1 p = 0.002835019018073878151881
# We fail to reject the null hypothesis of homoskedastic errors
#There is homoscadesticity (the variance in y is the same across all values of X), therefore linear regression may be an appropriate method of analysis.
# Zero Conditional Mean and Normality of Residuals
#As demonstrated in the QQ-Plot below of the studendized residuals, the linearity of the points suggests that the residuals are normally distributed with a mean of 0 and further confirms that linear regression is appropriate for our dataset.
qqPlot(h1.lm, main="QQ Plot")
# distribution of studentized residuals
# We can also use a combination of a density plot and histogram to visualize that the normality assumption holds true as demonstrated below.
library(MASS)
sresid <- studres(h1.lm)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
We then tested hypotheses about the relationship between average daily tweet sentiment and unit of time (day of the week, month of the year). We included frequency (volume) as a covariate since it has been established that there is a relationship between tweet volume and sentiment.
#Hyp. 2: Sentiment ~ day of the week when stratifying on freq.
fit <- lm(score_date ~ weekend + freq, data=mb_sentiment_date)
summary(fit)
##
## Call:
## lm(formula = score_date ~ weekend + freq, data = mb_sentiment_date)
##
## Residuals:
## Min 1Q Median
## -0.913941886353998311776 -0.111625442852886419542 0.004439338688796008216
## 3Q Max
## 0.129110505028944805295 0.522736248833435168670
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.197737773694860202722 0.046334250929689776988
## weekendMonday -0.008937570568028744417 0.034133533800613652387
## weekendSaturday -0.020136213788559419180 0.035153506314741143901
## weekendSunday -0.047368481944821627661 0.033489715597243174816
## weekendThursday 0.006367932829306290933 0.033798176420926417940
## weekendTuesday 0.014684371952231413394 0.036068987978912080139
## weekendWednesday 0.029142393965417769108 0.035579064528388416877
## freq -0.001040042493892622899 0.000265750493213265512
## t value Pr(>|t|)
## (Intercept) -4.2676400000000001000 2.4134e-05 ***
## weekendMonday -0.2618400000000000172 0.79356437
## weekendSaturday -0.5728100000000000414 0.56706263
## weekendSunday -1.4144200000000000106 0.15793514
## weekendThursday 0.1884099999999999941 0.85064034
## weekendTuesday 0.4071199999999999819 0.68411544
## weekendWednesday 0.8190899999999999848 0.41317254
## freq -3.9136099999999998111 0.00010509 ***
## ---
## Signif. codes:
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*'
## 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
##
## Residual standard error: 0.1835202903712500610567 on 447 degrees of freedom
## Multiple R-squared: 0.06562125309623642599, Adjusted R-squared: 0.05098892372637886528
## F-statistic: 4.484675777693695409 on 7 and 447 DF, p-value: 7.660462459531428524e-05
#REJECT THE NULL (p<0.001), conclusion: When stratifying on day of the week, the volume of tweets on a given day is a statistically significant predictor of the average daily sentiment score. For every additional tweet, we would expect a 0.001 decrease in average daily sentiment score. In this model, the days of the week are not statistically associated with the average daily sentiment score.
#Hyp. #2a: sentiment ~ month when stratifying on freq.
fit <- lm(score_date ~ month + freq, data=mb_sentiment_date)
summary(fit) # show results
##
## Call:
## lm(formula = score_date ~ month + freq, data = mb_sentiment_date)
##
## Residuals:
## Min 1Q Median
## -0.965868217797633765720 -0.117529053489835902391 0.002146391612874949673
## 3Q Max
## 0.118670618259850868537 0.539413011313623513843
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.145645450037130447241 0.043587299803447387969
## monthAugust -0.091087021719467600733 0.046378641187270988611
## monthDecember -0.127423079668042865675 0.046981049073580354036
## monthFebruary -0.063907719951761352006 0.041033598097158327511
## monthJanuary -0.010278990981057988321 0.040768929173624998685
## monthJuly -0.062103854699109992121 0.046969682078097263478
## monthJune -0.101764721770906341192 0.046906750042090480934
## monthMarch -0.043607322875666973405 0.040260284930094460132
## monthMay -0.046374001306828803448 0.046467078732858366208
## monthNovember 0.029826737327879979111 0.046785286539483314505
## monthOctober -0.110560705988141438905 0.046367709790152672467
## monthSeptember -0.055365973160245676721 0.046844149142175384171
## freq -0.001060791520654452659 0.000248406171549808328
## t value Pr(>|t|)
## (Intercept) -3.3414700000000001623 0.00090414 ***
## monthAugust -1.9639899999999999025 0.05015796 .
## monthDecember -2.7122199999999998532 0.00694409 **
## monthFebruary -1.5574500000000000011 0.12007971
## monthJanuary -0.2521300000000000208 0.80105928
## monthJuly -1.3222100000000001074 0.18678153
## monthJune -2.1695099999999998275 0.03057610 *
## monthMarch -1.0831299999999999262 0.27933902
## monthMay -0.9979999999999999982 0.31882689
## monthNovember 0.6375199999999999756 0.52411376
## monthOctober -2.3844300000000000495 0.01752634 *
## monthSeptember -1.1819200000000000816 0.23787346
## freq -4.2703899999999999082 2.3903e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*'
## 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
##
## Residual standard error: 0.1810211834424542420852 on 442 degrees of freedom
## Multiple R-squared: 0.1010649338394147279, Adjusted R-squared: 0.07665945693007769624
## F-statistic: 4.141075964827750866 on 12 and 442 DF, p-value: 3.640325067888974723e-06
#REJECT THE NULL, conclusion: When we stratify on volume of tweets in a given day, we see that month is a statistically significant predictor of the average daily sentiment score in some cases. The months of December, June, and October are statistically significantly associated with average daily sentiment score. Looking at the month of December, for example, December we would expect, on average, a 0.127 decrease in average daily sentiment for every additional tweet.
We then explored the relationship between weekend (Friday, Saturday, or Sunday) and tweet sentiment. First, we looked at the average tweet score for all tweets occurring on weekends vs. non-weekends. Then, we looked at the average of the daily tweet scores, each of which is the average sentiment for all tweets on that day. The first approach treats all tweets equally, while the second approach avoids overrepresenting days with many tweets, such as days when a significant event occurs.
#Hyp. #3a: sentiment weekend = sentiment weekday (each weekend day is weighted by tweet volume)
the_weekend = mb_sentiment_tweet %>% filter(weekend_binary == 1)
not_the_weekend = mb_sentiment_tweet %>% filter(weekend_binary == 0)
var.test(the_weekend$score_tweet,not_the_weekend$score_tweet)#variances are equal if p-value > 0.05
##
## F test to compare two variances
##
## data: the_weekend$score_tweet and not_the_weekend$score_tweet
## F = 0.9731616443081052558739, num df = 24521, denom df = 27160,
## p-value = 0.02903721730893240802
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9496943969676544528014 0.9972241636103265083335
## sample estimates:
## ratio of variances
## 0.9731616443081053668962
t.test(the_weekend$score_tweet,not_the_weekend$score_tweet)#,var.equal = TRUE)
##
## Welch Two Sample t-test
##
## data: the_weekend$score_tweet and not_the_weekend$score_tweet
## t = -1.863029265079494178892, df = 51277.88406580626178766,
## p-value = 0.06246386206222322157
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.046839875630709883191916 0.001188210896909217776513
## sample estimates:
## mean of x mean of y
## -0.3595954652964684639116 -0.3367696329295681301197
#Conclusion: When looking at all weekend vs weekday tweet scores as comparing two large group, the mean tweet score on weekends is not significantly different than the mean tweet score on non-weekends, *without* stratifying on tweet volume.
options(digits = 3)
# The following plot provides visual evidence that there is no discerable difference between tweet scores on weekdays vs. weekends.
ggplot(mb_sentiment_tweet, aes(x=weekend_binary, y=score_tweet, group=weekend_binary)) +
geom_boxplot(aes(fill=weekend_binary)) +
xlab("Non-weekend vs weekend") +
ylab("Tweet sentiment score") +
ggtitle("Variation in tweet sentiment, weekends v non-weekends") +
geom_jitter(colour="gray40",
position=position_jitter(width=0.2), alpha=0.3)
options(digits = 22)
#Hyp. #3b: sentiment weekend = sentiment weekday (each weekend day is weighted equally, without regard for tweet volume on that day)
the_weekend_date = mb_sentiment_date %>% filter(weekend_binary == 1)
not_the_weekend_date = mb_sentiment_date %>% filter(weekend_binary == 0)
var.test(the_weekend_date$score_date,not_the_weekend_date$score_date)#variances are equal if p-value > 0.05
##
## F test to compare two variances
##
## data: the_weekend_date$score_date and not_the_weekend_date$score_date
## F = 0.8137487663389596503904, num df = 194, denom df = 259,
## p-value = 0.1290344800199949571
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6266002889347471738191 1.0621544358948693531630
## sample estimates:
## ratio of variances
## 0.8137487663389596503904
t.test(the_weekend_date$score_date,not_the_weekend_date$score_date)#,var.equal = TRUE)
##
## Welch Two Sample t-test
##
## data: the_weekend_date$score_date and not_the_weekend_date$score_date
## t = -3.164023196220101752374, df = 437.823550281808991258, p-value
## = 0.001664479617936623053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08935142162436505997025 -0.02087934760640418330180
## sample estimates:
## mean of x mean of y
## -0.347692307692307689404 -0.292576923076923067768
#Conclusion: The mean of daily tweet scores on weekends is significantly less than the mean tweet score on non-weekends, *without* stratifying on tweet volume.
ggplot(mb_sentiment_date, aes(x=weekend_binary, y=score_date, group=weekend_binary)) +
geom_boxplot(aes(fill=weekend_binary)) +
xlab("Non-weekend vs. weekend") +
ylab("Tweet sentiment score") +
ggtitle("Variation in tweet sentiment between weekends and non-weekends") +
geom_jitter(colour="gray40",
position=position_jitter(width=0.2), alpha=0.3)
For the following test, we pursued the association between average sentiment score for tweet-days and weekend status. We performed the same test as above but added tweet volume as a covariate.
#Hyp. #4: sentiment weekend = sentiment weekday when stratifying on freq.
#[multiple linear regression]
fit <- lm(score_date ~ weekend_binary + freq, data=mb_sentiment_date)
summary(fit) # show results
##
## Call:
## lm(formula = score_date ~ weekend_binary + freq, data = mb_sentiment_date)
##
## Residuals:
## Min 1Q
## -0.8871013636220889120310 -0.1138370059873222844837
## Median 3Q
## -0.0005800629883541364926 0.1366642915739424424260
## Max
## 0.5284111630259511471763
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.191402883349309244077 0.026541605135875265264
## weekend_binary -0.033494764410495252160 0.018099554233530910224
## freq -0.001000123577263304546 0.000237104074115412997
## t value Pr(>|t|)
## (Intercept) -7.211430000000000007 2.3466e-12 ***
## weekend_binary -1.850589999999999957 0.064882 .
## freq -4.218079999999999607 2.9782e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*'
## 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
##
## Residual standard error: 0.1832369268012348506591 on 452 degrees of freedom
## Multiple R-squared: 0.05808505788518890378, Adjusted R-squared: 0.05391729265459255238
## F-statistic: 13.93673939663715267 on 2 and 452 DF, p-value: 1.338570123268281934e-06
#DO NOT HAVE ENOUGH EVIDENCE TO REJECT THE NULL, conclusion: The weekend is not a statistically signiificant predictor of average daily sentiment score when we stratify on tweet volume.
One of our earliest hypotheses was that tweeting would increase on weekends. We conducted a t-test to test this hypothesis.
#Hyp. #5: tweet volume weekend = tweet volume weekday
var.test(the_weekend_date$freq,not_the_weekend_date$freq)
##
## F test to compare two variances
##
## data: the_weekend_date$freq and not_the_weekend_date$freq
## F = 1.61151502156849613101, num df = 194, denom df = 259, p-value
## = 0.0003435929306734308852
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.240893774475956723435 2.103447525175369037242
## sample estimates:
## ratio of variances
## 1.61151502156849613101
t.test(the_weekend_date$freq,not_the_weekend_date$freq)
##
## Welch Two Sample t-test
##
## data: the_weekend_date$freq and not_the_weekend_date$freq
## t = 6.077450668921644094667, df = 358.4420249986544035892, p-value
## = 3.123646802814052186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 14.62258946873947529355 28.61330796715796864760
## sample estimates:
## mean of x mean of y
## 122.7794871794871767179 101.1615384615384556355
#REJECT THE NULL (STRONGLY), conclude that tweet volume on weekends is statistically differrent thantweet volume on weekdays.
ggplot(mb_sentiment_date, aes(x=weekend_binary, y=freq, group=weekend_binary)) +
geom_boxplot(aes(fill=weekend_binary)) +
xlab("Non-weekend vs. weekend") +
ylab("Tweet volume") +
ggtitle("Variation in tweet volume between weekends and non-weekends") +
geom_jitter(colour="gray40",
position=position_jitter(width=0.2), alpha=0.3)
We also conducted a t-test to test the hypothesis that tweet sentiment would be significantly different on holidays (federal holidays, Valentine’s Day, and Halloween). This test was inspired by an early observation that there appeared to be many more tweets on Valentine’s Day than on an average mid-February day. (More on that later.)
#Hyp. #6: sentiment holiday = sentiment !holiday
#Without stratifying on volume
holiday = mb_sentiment_holidays %>% filter(holiday == 1)
not_holiday = mb_sentiment_holidays %>% filter(holiday == 0)
var.test(holiday$score_date,not_holiday$score_date)#variances are equal if p-value > 0.05
##
## F test to compare two variances
##
## data: holiday$score_date and not_holiday$score_date
## F = 1.778016151376976772269, num df = 14, denom df = 439, p-value
## = 0.07876816576272815951
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9378823236136536456442 4.4542612993910042362700
## sample estimates:
## ratio of variances
## 1.778016151376976772269
t.test(holiday$score_date,not_holiday$score_date,var.equal = TRUE)
##
## Two Sample t-test
##
## data: holiday$score_date and not_holiday$score_date
## t = 0.4776042142613266539009, df = 453, p-value =
## 0.6331622911027887657
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07364459525125839622905 0.12093247403913717774593
## sample estimates:
## mean of x mean of y
## -0.2933333333333333348136 -0.3169772727272727186332
#There is no statistically significant association between holiday and tweet sentiment when we do not stratify on volume.
#Stratifying on volume
#[multiple linear regression]
fit <- lm(score_date ~ holiday + freq, data=mb_sentiment_holidays)
summary(fit) # show results
##
## Call:
## lm(formula = score_date ~ holiday + freq, data = mb_sentiment_holidays)
##
## Residuals:
## Min 1Q
## -0.9184033763746414225082 -0.1149139540316618801619
## Median 3Q
## -0.0001677027549093117534 0.1310640478230671313042
## Max
## 0.5377804548710374277221
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.1926225448123847217019 0.0268247362502647690441
## holiday 0.0089973249676527176227 0.0483835528773651560308
## freq -0.0011217598516216756906 0.0002286845801077706255
## t value Pr(>|t|)
## (Intercept) -7.1807800000000003848 2.8706e-12 ***
## holiday 0.1859599999999999864 0.85256
## freq -4.9052699999999997971 1.3040e-06 ***
## ---
## Signif. codes:
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*'
## 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
##
## Residual standard error: 0.1839227463960314001845 on 452 degrees of freedom
## Multiple R-squared: 0.05102106014278585955, Adjusted R-squared: 0.04682203828501063647
## F-statistic: 12.15070125160475101 on 2 and 452 DF, p-value: 7.244123306266066887e-06
#When we stratify on tweet volume, we still see that there is no statistically significant association between holiday and tweet sentiment.
#echart
options(digits = 3)
ggplot(mb_sentiment_holidays, aes(x=holiday, y=score_date, group=holiday)) +
geom_boxplot(aes(fill=holiday)) +
xlab("Non-holiday vs. holiday") +
ylab("Tweet sentiment score") +
ggtitle("Variation in tweet sentiment score between holidays and non-holidays") +
geom_jitter(colour="gray40",
position=position_jitter(width=0.2), alpha=0.3)
options(digits = 22)
We built word clouds and word webs to convey the frequency and cooccurrance of words.
#Postive and negative word data sets.
positives <- bing_megabus %>%
filter(sentiment == "positive") %>%
dplyr::select(word)
negatives = bing_megabus %>%
filter(sentiment == "negative") %>%
dplyr::select(word)
head(by_word)
## Source: local data frame [6 x 8]
##
## id date date2 time weekend
## (dbl) (time) (chr) (chr) (chr)
## 1 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## 2 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## 3 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## 4 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## 5 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## 6 550802963749281792 2015-01-01 18:57:14 01-01-15 18:57:14 Thursday
## Variables not shown: weekend_binary (dbl), month (chr), word (chr)
word_list <- by_word %>% dplyr::select(word)
head(word_list)
## Source: local data frame [6 x 1]
##
## word
## (chr)
## 1 wlzkhallfa
## 2 let
## 3 me
## 4 fly
## 5 you
## 6 greyhound
word_list_negatives <- subset(word_list, word %in% negatives$word)
head(word_list_negatives)
## Source: local data frame [6 x 1]
##
## word
## (chr)
## 1 smash
## 2 working
## 3 struggle
## 4 waiting
## 5 never
## 6 wait
word_list_positives <- subset(word_list, word %in% positives$word)
head(word_list_positives)
## Source: local data frame [6 x 1]
##
## word
## (chr)
## 1 love
## 2 work
## 3 comfortable
## 4 saver
## 5 gold
## 6 good
# Negative Word Cloud
word_list_negatives <- Corpus(VectorSource(word_list_negatives))
inspect(word_list_negatives)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## $word
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 295094
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_negatives <- tm_map(word_list_negatives, toSpace, "/")
word_list_negatives <- tm_map(word_list_negatives, toSpace, "@")
word_list_negatives <- tm_map(word_list_negatives, toSpace, "\\|")
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_negatives)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
## word freq
## never never 2382
## still still 1658
## waiting waiting 1569
## shit shit 1005
## fuck fuck 960
## crash crash 943
## fucking fucking 879
## bad bad 855
## wait wait 833
## worst worst 830
# Word Cloud (Negative)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, scale=c(4,0.5), rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
##### Positive Word Cloud #####
word_list_positives <- Corpus(VectorSource(word_list_positives))
#inspect(word_list_positives)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_positives <- tm_map(word_list_positives, toSpace, "/")
word_list_positives <- tm_map(word_list_positives, toSpace, "@")
word_list_positives <- tm_map(word_list_positives, toSpace, "\\|")
word_list_positives <- tm_map(word_list_positives, removeWords, c("megabus", "the", "and", "https", "you", "t.co", "for", "this", "bus", "that"))
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_positives)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
## word freq
## good good 1688
## right right 1316
## love love 1246
## work work 1069
## cheap cheap 1056
## better better 1017
## free free 931
## well well 889
## great great 841
## gold gold 768
# Word Cloud (Positive)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, rot.per=0.35, scale=c(4,0.5),
colors=brewer.pal(8, "Dark2"))
Those outlying days, where everyone was tweeting a lot? Now we’re ready to see what some of that tweeting was all about.
April 13, 2015: 19 injured in Megabus crash on I-65 in Indiana
by_word <- tweets_df_all %>%
dplyr::select(text, id, date, date2, time, weekend, weekend_binary, month) %>%
unnest_tokens(word, text)
word_list <- by_word %>% dplyr::select(word, date2)
word_list <- subset(word_list, word %in% bing_megabus$word)
word_list_date <- word_list %>%
filter(date2=="04-13-15")
word_list_date <- word_list_date %>% dplyr::select(word)
word_list_date <- Corpus(VectorSource(word_list_date))
#inspect(word_list_date)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_date <- tm_map(word_list_date, toSpace, "/")
word_list_date <- tm_map(word_list_date, toSpace, "@")
word_list_date <- tm_map(word_list_date, toSpace, "\\|")
word_list_date <- tm_map(word_list_date, removeWords, c("megabus", "the", "and", "https", "you", "t.co", "for", "this", "bus", "that"))
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_date)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# Word Cloud (Date)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, scale=c(6,0.5), rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
May 13, 2015: ‘Hero’ passenger subdues gunman who may have tried to take over Megabus
by_word <- tweets_df_all %>%
dplyr::select(text, id, date, date2, time, weekend, weekend_binary, month) %>%
unnest_tokens(word, text)
word_list <- by_word %>% dplyr::select(word, date2)
word_list <- subset(word_list, word %in% bing_megabus$word)
word_list_date <- word_list %>%
filter(date2=="05-13-15")
word_list_date <- word_list_date %>% dplyr::select(word)
word_list_date <- Corpus(VectorSource(word_list_date))
#inspect(word_list_date)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_date <- tm_map(word_list_date, toSpace, "/")
word_list_date <- tm_map(word_list_date, toSpace, "@")
word_list_date <- tm_map(word_list_date, toSpace, "\\|")
word_list_date <- tm_map(word_list_date, removeWords, c("megabus", "the", "and", "https", "you", "t.co", "for", "this", "bus", "that"))
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_date)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# Word Cloud (Date)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, scale=c(4,0.5), rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
February 21, 2016: The Day My Megabus Caught Fire
by_word <- tweets_df_all %>%
dplyr::select(text, id, date, date2, time, weekend, weekend_binary, month) %>%
unnest_tokens(word, text)
word_list <- by_word %>% dplyr::select(word, date2)
word_list <- subset(word_list, word %in% bing_megabus$word)
word_list_date <- word_list %>%
filter(date2=="02-21-16")
word_list_date <- word_list_date %>% dplyr::select(word)
word_list_date <- Corpus(VectorSource(word_list_date))
#inspect(word_list_date)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_date <- tm_map(word_list_date, toSpace, "/")
word_list_date <- tm_map(word_list_date, toSpace, "@")
word_list_date <- tm_map(word_list_date, toSpace, "\\|")
word_list_date <- tm_map(word_list_date, removeWords, c("megabus", "the", "and", "https", "you", "t.co", "for", "this", "bus", "that"))
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_date)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# Word Cloud (Date)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, scale=c(5,0.5), rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
by_word <- tweets_df_all %>%
dplyr::select(text, id, date, date2, time, weekend, weekend_binary, month) %>%
unnest_tokens(word, text)
word_list <- by_word %>% dplyr::select(word, date2)
word_list <- subset(word_list, word %in% bing_megabus$word)
word_list_date <- word_list %>%
filter(date2=="02-14-16"|date2=="02-14-15")
word_list_date <- word_list_date %>% dplyr::select(word)
word_list_date <- Corpus(VectorSource(word_list_date))
#inspect(word_list_date)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
word_list_date <- tm_map(word_list_date, toSpace, "/")
word_list_date <- tm_map(word_list_date, toSpace, "@")
word_list_date <- tm_map(word_list_date, toSpace, "\\|")
word_list_date <- tm_map(word_list_date, removeWords, c("megabus", "the", "and", "https", "you", "t.co", "for", "this", "bus", "that"))
#Build a term-document matrix
dtm <- TermDocumentMatrix(word_list_date)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# Word Cloud (Date)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=125, random.order=FALSE, scale=c(3,0.5), rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
This is a map of words that occur in pairs in the entire data set. # Word Cooccurence
word_cooccurences <- by_word %>% dplyr::select(word, id, date2)
word_cooccurences <- subset(word_cooccurences, word %in% bing_megabus$word)
word_cooccurences <- word_cooccurences %>%
pair_count(id, word, sort = TRUE) %>%
dplyr::filter(n>25)
set.seed(2016)
word_cooccurences %>%
filter(n >= 20) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour="gray") +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8, size=5) +
theme_void()
We observed a clear inverse correlation between tweet volume and tweet sentiment. That is, on days when there were more tweets, tweet sentiment tended to be lower. This association was very strong, with a highly significant p-value in all instances.
We did not clearly identify a causal pathway for this association, but rather developed tow theories which provide the basis for future research. The first theory is that on days when people have bad experiences related to Megabus, they will tend to tweet more, because often people tweet to complain. The second theory is that on days when there are more Megabuses on the road, there are naturally more tweets due to the larger group of passengers, and that when there is more Megabus activity there will tend to be a higher risk of breakdowns, delays, traffic jams, and other negative experiences.
It is entirely plausible that both of these effects are causing the association, or perhaps just one, or perhaps the association is by chance.
To test these theories, we would need to obtain data on Megabus activity, perhaps by scraping a travel site. Then we could determine if tweet activity is primarily an indicator Megabus activity, or a downstream consequence of negative Megabus experiences, or both.
Through our work, we hope to help people understand the trends in Megabus tweet sentiment so that they can make informed decisions and have the best Megabus experience possible!